EXPLORING A DATASET - WHITE WINE

INTRODUCTION TO DATASET

This report explores a datset containing an expert quality assessment and chemical composition data (11 variables) for 4,898 white wines (all Portuguese ‘Vinho Verde’).

Univariate Plots Section

## [1] "whitewine dataframe"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Preliminary assessment of the data suggests that the quality score may often be usefully treated as a factor variable, as it only takes integer scores (and only scores from 3-9 are seen - so it is effectively a 6 point scale fo this dataset). An additional variable (quality.factor) is introduced for when a factor scale is more useful.

A quick view shows that quality scores range from 3-8, with the majority between 5-7. Scores show a relatively normal distribution. It is also worth noting that there are very few wines with very high or low quality scores of 3 (20 wines) or 9 (5 wines).

Residual sugar, when expanded to a log_10 scale, shows a bimodal distribution, with peaks at c. 1.3 and 10 (red lines). White wines are commonly regarded as either ‘dry’ or ‘sweet’ - so perhaps it is worth splitting the whitewines into these two categories for the analysis, split at around residual.sugar = 3 (the orange line)?

# creates new sweetnes variable based on residual.sugar content
whitewine$sweetness <- NA
whitewine$sweetness <- factor(ifelse(
                              whitewine$residual.sugar >=3, 'sweet','dry'))

# also adds new dataframes of only sweet or dry white wines
whitewine.sweet <- whitewine[whitewine$sweetness == 'sweet',]
whitewine.dry <- whitewine[whitewine$sweetnes == 'dry',]

I have created a new variable ‘sweetness’, with values ‘dry’ and ‘sweet’.
NOTE Technically this is probably a bivariate plot - I only include it here to clarify the split I have made in the data at this point, as it features in future analysis.

## [1] "whitewine.sweet dataframe"
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0900   Min.   :0.0000  
##  1st Qu.:1300   1st Qu.: 6.400   1st Qu.:0.2200   1st Qu.:0.2600  
##  Median :2506   Median : 6.800   Median :0.2700   Median :0.3100  
##  Mean   :2492   Mean   : 6.874   Mean   :0.2871   Mean   :0.3364  
##  3rd Qu.:3716   3rd Qu.: 7.300   3rd Qu.:0.3300   3rd Qu.:0.3900  
##  Max.   :4895   Max.   :11.800   Max.   :1.1000   Max.   :1.2300  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 3.000   Min.   :0.01400   Min.   :  2.00     
##  1st Qu.: 6.000   1st Qu.:0.03700   1st Qu.: 27.00     
##  Median : 8.400   Median :0.04400   Median : 37.00     
##  Mean   : 9.335   Mean   :0.04674   Mean   : 38.68     
##  3rd Qu.:12.400   3rd Qu.:0.05100   3rd Qu.: 50.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :131.00     
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   : 18.0        Min.   :0.9887   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:118.0        1st Qu.:0.9936   1st Qu.:3.080   1st Qu.:0.4100  
##  Median :149.0        Median :0.9955   Median :3.160   Median :0.4700  
##  Mean   :149.6        Mean   :0.9954   Mean   :3.171   Mean   :0.4838  
##  3rd Qu.:179.0        3rd Qu.:0.9975   3rd Qu.:3.250   3rd Qu.:0.5400  
##  Max.   :366.5        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##                                                                        
##     alcohol         quality      quality.factor sweetness   
##  Min.   : 8.00   Min.   :3.000   3:  12         dry  :   0  
##  1st Qu.: 9.30   1st Qu.:5.000   4:  78         sweet:3030  
##  Median : 9.90   Median :6.000   5: 980                     
##  Mean   :10.23   Mean   :5.844   6:1371                     
##  3rd Qu.:11.00   3rd Qu.:6.000   7: 480                     
##  Max.   :14.05   Max.   :9.000   8: 107                     
##                                  9:   2
## [1] "whitewine.dry dataframe"
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   2   Min.   : 4.200   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1145   1st Qu.: 6.200   1st Qu.:0.1900   1st Qu.:0.2700  
##  Median :2351   Median : 6.700   Median :0.2500   Median :0.3200  
##  Mean   :2381   Mean   : 6.823   Mean   :0.2638   Mean   :0.3307  
##  3rd Qu.:3593   3rd Qu.: 7.300   3rd Qu.:0.3100   3rd Qu.:0.3800  
##  Max.   :4898   Max.   :14.200   Max.   :1.0050   Max.   :1.6600  
##                                                                   
##  residual.sugar    chlorides       free.sulfur.dioxide
##  Min.   :0.600   Min.   :0.00900   Min.   :  3.00     
##  1st Qu.:1.200   1st Qu.:0.03400   1st Qu.: 19.00     
##  Median :1.500   Median :0.04000   Median : 28.00     
##  Mean   :1.617   Mean   :0.04421   Mean   : 29.84     
##  3rd Qu.:1.900   3rd Qu.:0.04800   3rd Qu.: 38.00     
##  Max.   :2.900   Max.   :0.27100   Max.   :289.00     
##                                                       
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.740   Min.   :0.2500  
##  1st Qu.: 96.0        1st Qu.:0.9906   1st Qu.:3.100   1st Qu.:0.4100  
##  Median :117.0        Median :0.9918   Median :3.210   Median :0.4800  
##  Mean   :120.1        Mean   :0.9918   Mean   :3.216   Mean   :0.4997  
##  3rd Qu.:142.2        3rd Qu.:0.9930   3rd Qu.:3.320   3rd Qu.:0.5600  
##  Max.   :440.0        Max.   :0.9980   Max.   :3.810   Max.   :1.0600  
##                                                                        
##     alcohol         quality      quality.factor sweetness   
##  Min.   : 8.00   Min.   :3.000   3:  8          dry  :1868  
##  1st Qu.:10.10   1st Qu.:5.000   4: 85          sweet:   0  
##  Median :10.90   Median :6.000   5:477                      
##  Mean   :10.97   Mean   :5.933   6:827                      
##  3rd Qu.:11.80   3rd Qu.:7.000   7:400                      
##  Max.   :14.20   Max.   :9.000   8: 68                      
##                                  9:  3

Let’s look at quality again.

Quality distribution of sweet and dry whitewines appears similar.

Fixed acidity shows a relatively thin normal distribution about a median of 6.8 (black line), with no noticeable sweet / dry difference. Black line is median fixed.acidity of all wines.

Volatile acidity shows a slightly right skewed normal distribution about a median of 0.26 (black line), with a few outliers at >0.8. Black line is median volatile.acidity of all wines.

Citric acid shows a generally normal distribution, with an odd peak at c. 0.49, and a smaller one at c. 0.74 (red lines). Black line is median citric.acid of all wines.

A ‘zoom’ into the histogram in this region shows that there certainly appears to be a local ‘spike’ in the data at citric.acid = 0.49. I wonder if there is some form of target / guideline to aim for ‘below 0.5’ for citric.acid during the wine making process? (Could this be Goodharts Law in action?). Or alternatively, could this be some measurement artefact?

Zooming in again, There is a similar (though smaller) local ‘spike’ at 0.74 - again, is there some effect clustering values below a ‘round’ value of 0.75?

There appears to be some effect causing a local spike in citric.acid at 0.49 and 0.74, just below the ‘round’ values of 0.5 and 0.75. My suspicion at this point is that 0.5 and 0.75 could be some form of ‘target levels’ which winemakers aim to be below.

Chlorides show a tight normal distribution about a median of 0.043, with a few outliers above 0.1. Here the dry wines show slightly lower levels of chlorides than sweet wines. Black line is median chlorides of all wines.

Free sulfur dioxide shows a (very) slightly right skewed distribution about a median of 34.0. There is a noticeable difference here between sweet and dry, with dry wines showing lower free.sulfur.dioxide. Black line is median free.sulfur.dioxide of all wines.

Total sulfur dioxide shows a more symmetrical normal distribution about a median of 134.0. Again, there is a noticeable difference here between sweet and dry, with dry wines showing lower total.sulfur.dioxide. Black line is median total.sulfur.dioxide of all wines.

Density shows a normal distribution in a very tight range (mostly 0.99 < density < 1.00). A marked difference between sweet and dry is clear: dry wines are lower density. Black line is median density of all wines.

pH shows a ‘very’ neat normal distribution around a median (all wines) of 3.18 (black line).

Sulphates show a slightly right skewed distribution around a median (all wines) of 0.47.

Alcohol shows a more ‘spread’ distribution, with median of 10.4. And a clear difference between sweet and dry wines. Let’s look at them separately.

Dry wines now a more symmetrical normal distribution , while sweet wines show a noticeable ‘right skew’. Black line is median alcohol of all wines.

Univariate Analysis

What is the structure of your dataset?

The data contains data on 11 chemical parameters and a expert quality assessment (on a scale of 1-10) for 4,898 white wines. All of the chemical parameters are measurements on a continuous scale. The quality score is integers (only) from 1-10, and only scores from 3-9 are observed, so for some aspects of this analysis it makes sense to consider quality as an ordered factor varaible.

Most wines have a quality score of 5-7, with few scoring either very high (quality.factor = 9) or very low (quality.factor = 3) quality scores.

Most of the chemical parameters show normal or slightly skewed distributions, with a few exceptions worth noting:

  • citric.acid has two ‘local’ maxima / spikes in distribution at 0.49 and 0.74
  • residual.sugar (viewed on a log scale) shows a bimodal distribution
  • alcohol shows a broad distribution

What is/are the main feature(s) of interest in your dataset?

Based on the bimodal distribution of residual.sugar, and knowledge that white wines are conventionally classified as ‘sweet’ or ‘dry’ based on sugar / sweeteness, it makes sense to classify the population into sweet and dry wines (at residual.sugar >3 or <3), and see how the analyses & correlations vary between these two sets. The sulfur.dioxide variables (free. and total.) both show differences between sweet and dry wines, as do density and alcohol.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I certainly think it will be useful to consider the difference between the sweet and dry wines for future analysis. At this it is hard to know what variables or correlations will prove most useful.

Did you create any new variables from existing variables in the dataset?

It appears worthwhile to add a ‘quality.factor’ variable - for views where quality is more usefully considered as a factor in subsequent plots.

I also added a ‘sweetness’ variable, to split the data up into ‘sweet’ and ‘dry’ wines (at residual.sugar >3 or <3).

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I used a log_10 scale for the residual sugar to highlight the bimodality of the distribution, and enable the splitting between sweet and dry.

Bivariate Plots Section

Let’s start by looking at the correlation matrix for the whole dataset.

An initial look at a correlation matrix indicates a few areas of interest:

I thought it might also be interesting to see where the correlation matrices show the greatest differences when evaluated separately for the ‘sweet’ and ‘dry’ white wines:

# calculating the 'difference' between the correlation matrices for 
# sweet and dry wines
m.sweet <- cor(whitewine.sweet[c(13, 2:12)])
m.dry <- cor(whitewine.dry[c(13, 2:12)])

m.diff <-  m.sweet - m.dry

m.diff
##                           quality fixed.acidity volatile.acidity
## quality               0.000000000   0.103847402      0.060402068
## fixed.acidity         0.103847402   0.000000000      0.086271067
## volatile.acidity      0.060402068   0.086271067      0.000000000
## citric.acid          -0.097525268   0.003264288      0.192705704
## residual.sugar       -0.319537384   0.153287995     -0.140546252
## chlorides            -0.052932297  -0.019885777      0.023636359
## free.sulfur.dioxide  -0.219418791   0.089155598     -0.008518115
## total.sulfur.dioxide -0.139796651   0.023420972     -0.005980721
## density               0.079336798  -0.140449627     -0.051435814
## pH                   -0.108615676   0.078145055      0.038409079
## sulphates            -0.185374791   0.039546297      0.207529360
## alcohol               0.003887193   0.077513043      0.106199309
##                       citric.acid residual.sugar   chlorides
## quality              -0.097525268    -0.31953738 -0.05293230
## fixed.acidity         0.003264288     0.15328800 -0.01988578
## volatile.acidity      0.192705704    -0.14054625  0.02363636
## citric.acid           0.000000000     0.19792871 -0.29290577
## residual.sugar        0.197928706     0.00000000  0.13804589
## chlorides            -0.292905769     0.13804589  0.00000000
## free.sulfur.dioxide   0.177508348     0.14656752  0.08945506
## total.sulfur.dioxide  0.155075239     0.15423363  0.08997537
## density               0.090428010     0.83819069 -0.07232324
## pH                    0.024615905    -0.21846280  0.02422776
## sulphates             0.011457022    -0.09799763  0.04519596
## alcohol              -0.097298545    -0.66917474  0.03628726
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## quality                     -0.219418791         -0.139796651  0.07933680
## fixed.acidity                0.089155598          0.023420972 -0.14044963
## volatile.acidity            -0.008518115         -0.005980721 -0.05143581
## citric.acid                  0.177508348          0.155075239  0.09042801
## residual.sugar               0.146567517          0.154233627  0.83819069
## chlorides                    0.089455056          0.089975367 -0.07232324
## free.sulfur.dioxide          0.000000000          0.101141224  0.31209367
## total.sulfur.dioxide         0.101141224          0.000000000  0.13002179
## density                      0.312093672          0.130021789  0.00000000
## pH                          -0.126954901         -0.088100208 -0.17722298
## sulphates                   -0.035318735          0.054952867 -0.01359132
## alcohol                     -0.300623358         -0.217728232  0.05682155
##                               pH   sulphates      alcohol
## quality              -0.10861568 -0.18537479  0.003887193
## fixed.acidity         0.07814506  0.03954630  0.077513043
## volatile.acidity      0.03840908  0.20752936  0.106199309
## citric.acid           0.02461591  0.01145702 -0.097298545
## residual.sugar       -0.21846280 -0.09799763 -0.669174742
## chlorides             0.02422776  0.04519596  0.036287255
## free.sulfur.dioxide  -0.12695490 -0.03531873 -0.300623358
## total.sulfur.dioxide -0.08810021  0.05495287 -0.217728232
## density              -0.17722298 -0.01359132  0.056821549
## pH                    0.00000000 -0.07551512  0.095147122
## sulphates            -0.07551512  0.00000000 -0.047923762
## alcohol               0.09514712 -0.04792376  0.000000000

If we look for the largest differences:

It is probably not surprising that the greatest differences are in correlations involving residual.sugar, as this is the dimension we have used to ‘split’ the dataset. But it will be interesting to view these different relationships across sweet vs. dry wines in the analysis below.

A review of the ‘pair plot’ matrix (split into 2, and using quality as a factor in each), highlights a few features worth investigating:

Let’s look at the boxplots vs. quality.factor in more detail.

NOTE It is important to remember in considering the plots below that only a small number of wines show quality scores of 9 (5 wines) or 3 (20 wines). So interpretation of the plots should focus principally on the quality.factor range 4-8.

Looking again (at a slightly greater scale) at boxplots of the various chemical parameters across quality scores. An initial look suggests that density and alcohol and chlorides show the clearest and most consistrent trends with quality. It is worth examining some of these parameters more closely, and see how the quality relationship might vary between sweet and dry whitewines.

Density shows a clear nagative trend with quality for both dry and sweet, but the variation across quality is more pronounced for the sweet wines. And there is a clear variation between optimal densities for sweet vs. dry white wines.

Alcohol shows a broadly positive correlation with quality for both sweet and dry wines, but the relationship appears more pronounced for sweet wines.

Chlorides do indeed show a (negative) correlation with quality, at slightly lower levels for dry wines than for sweet.

As the basis for the separation between dry & sweet, clearly residual sugar shows very different quality variation across the two groups. The trend for sweet whitewines is slightly negative: higher quality wines have less residual.sugar. However for dry whitewines, the trend is reversed - with higher quality wines showing slightly higher levels of residual.sugar.

Total.sulfur.dioxide shows a clear negative trend agains increasing quality for sweet wines, but a much less pronounced trend for dry wines.

The trend of sulphates with quality is slight, but does appear to be positive for dry wines and negative for sweet wines.

There is little consistent variation in pH with quality for sweet wines, but a very clear positive correlation for dry wines (higher pH means better wine).

Alcohol and density show a clear nagative correlation, and also a clear separation between sweet and dry wines (dry wines generally having lower density for a given alcohol level).

There is a clear positive correlation between residual.sugar and density for sweet wines, but no indication of any significant correlation for dry wines (albeit with a restricted range of residual.sugar).

Fixed.acidity shows a slight positive correlation with density, and also a reasonable separation between sweet and dry wines (dry wines are generally lower density for a given fixed.acidity).

The trends for residual.sugar vs. alcohol are less clear, but regression lines show different slope directions in sweet vs. dry wines.

There is some positive correlation between total.sulfur.dioxide and free.sulfur.dioxide, but no clear separation between dry and sweet wines.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

It is now clear from this analysis that it does make sense to consider the data for sweet and dry white wines separately when looking at some of the relationships. The strength / level of effect of certain variables on quality can vary between the two groups, and may even be ina reverse direction.

  • Density shows a clear negative correlation with quality (better wines are less dense). The effect is more pronounced for sweet wines than dry wines, which tend to have higher densities.
  • Alcohol shows a positive correlation with quality (more alcoholic wines are better). But over the ‘core’ quality range of most wines (5-7), the effect is more pronounced for sweet wines.
  • Chlorides show a negative correlation with quality, a trend slightly more pronounced for sweet wines.
  • Residual.sugar shows a slight negative correlation with quality for sweet wines (less sweet is better), but for dry wines the relationship is reversed (sweeter ‘dry’ wines are better).
  • Total.sulfur.dioxide shows a clear negative correlation with quality for sweet wines, but limited variation for dry wines.
  • pH shows a clear positive correlation with quality for dry wines (better wines have higher pH), but little variation with quality for sweet wines.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Density and alcohol correlate well (negatively) with each other for both sweet and dry wines, but with a clear separation between the datasets (sweet wines are higher density at given alcohol).

Residual.sugar shows a clear positive correlation with density for sweet wines (sweeter wines are denser), but no clear relationship is seen for dry wines. This may be a by product of the restriced range of residual.sugar for dry wines (<3) based on the definition of ‘dry’ wines, but the lack of any correlation at all is striking.

Fixed.acidity also shows a positive correlatio with density for both sets of wines, again with good separation between the datasets (sweet wines are denser at a givem fixed.acidity)

What was the strongest relationship you found?

  • The strongest relationship for quality is with alcohol
  • The strongest relationship between other features is between density and alcohol.

Multivariate Plots Section

There are only a few wines with quality scores <4 or >8, and they make insights harder to visualise on some plots, so for this section I will create a new set of ‘clipped’ data, with quality scores of 3 and 9 removed. (NOTE there are only 5 wines of quality 9 and 20 of quality 3 out of 4,898 wines in the data)

whitewine.sweet.clip <- whitewine.sweet[whitewine.sweet$quality.factor != 3 
                                      & whitewine.sweet$quality.factor != 9 ,]
whitewine.dry.clip <- whitewine.dry[whitewine.dry$quality.factor != 3 
                                       & whitewine.dry$quality.factor != 9 ,]
whitewine.clip <- whitewine[whitewine$quality.factor != 3 
                                      & whitewine$quality.factor != 9 ,]

Considering dry wines, it is clear that quality generally improves as alcohol increases and density decreases.

And a similar relationship can be seen in sweet wines.

Comparing the two plots, the pattern is the same across both - with the sweet wines showing a generally higher density.

Looking at other parameters, for dry wines it would appear that for a given density, a higher pH is likely to be a better wine, though the relationship is not strong.

A similar pattern is harder to find for sweet wines, buy there does appear to be indication that wines with higher residual.sugar need to also have higher alcohol levels to be a high quality.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The correlation between density and alcohol appears valuable in contributing to quality. For both sweet and dry wine datasets, quality tends to increase as alcohol rises and density decreases. Separating the date into sweet and dry datasets allows this relationship to be more clearly seen.

Other variables that reinforce the quality.factor are harder to discern, but it does appear that for dry wines, higher pH (at a given density) might be a useful indicator, and for sweet wines higher residual.sugar (at a given alcohol) might also be a useful indicator.

Were there any interesting or surprising interactions between features?

Nothing that I thought was particularly surprising.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I didn’t create any models, but it does seem likely that model accuracy would be higher if the sweet and dry wines were considered and modelled separately.


Final Plots and Summary

Plot One

Description One

A histogram of residual.sugar, expanded to a log_10 scale, shows the clear bimodal distribution of residual sugar, and the division of the data into ‘sweet’ and ‘dry’ wines at residual.sugar = 3

Plot Two

Description Two

A view of of density (and density distribution) varies with quality, for sweet and dry wines. Both sweet and dry wines show a clear trend of decreasing density with increasing quality, but at different levels of absolute density for sweet vs. dry wines. Dry wines are generally less dense and show a lower IQR than sweet wines of the same quality. And the variation in average density between ‘good’ (quality = 8), and bad (quality = 4) dry wines is smaller than the average density variation between the same quality levels for sweet wines.

Plot Three

Description Three

For both dry and sweet white wines, there is a clear pattern that quality improves as alcohol increases and density falls. The distributions for sweet vs. dry are separate but overlapping (sweet wines are denser at a given alcohol), and so it is clearer to view them separately.


Reflection

This has been an interesting exercise in ‘getting under the skin’ of an unfamiliar dataset, and trying to extract and discern trends and relationships within it.

I spent sometime looking at the various analyses before the insight came to me that it could be instructive to separate the data into ‘sweet’ and ‘dry’ wines, and view them separately. The bimodality of the residual.sugar distribution, and knowledge that white wines are commonly classified as ‘sweet’ or ‘dry’ made this feel like a reasonable step. And it does appear that winemakers are driving different aspects of the wines chemistry to make a ‘good’ sweet wine vs.a good ‘dry’ wine.

Once it was clear from the histrograms of total.sulful.dioxide and density that there clearly were other differences between the sweet and dry wines, I felt more confident in the decision to consider them separately.

From the Univriate analysis, a feature of interest was certainly the slightly anomalous ‘spikes’ in distribution of citric.acid at 0.49 and 0.74. The fact that these are just below some ‘round’ levels (of 0.5 and 0.75) still leaves me thinking there is some sort ot targetting citric.acid below these levels during the wine making process. But I suspect that bringing further insight to this anomaly would need further research and more data than is available in this dataset. And so is outside the scope of this work.

It was disappointing, perhaps, not to see any really strong and clear correlations (especially with quality). But I guess that’s what ‘real world data’ is like. Separating out to sweet and dry did seem to bring out some clarity though, and it became clear looking at the bivariate plots that factors that drive higher quality for sweet wines are certainly not exactly the same as those for dry wines. Alcohol and density remain the strongest drivers of quality for both (but at different levels), but more secondary factors (pH, residual.sugar) clearly vary between the two types.

It seems clear to me that modelling of this data (for quality) is likely to be much more effective if done separately for sweet vs. dry wines. And it could therefore be worth investigating in more detail how to improve this division (perhaps a combination of factors, rather than residual.sugar alone?).